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Abstract 

Parameter ensembles or sets of point estimates constitute one of the cornerstones of mod- 
ern statistical practice. This is especially the case in Bayesian hierarchical models, where 
different decision-theoretic frameworks can be deployed to summarize such parameter ensem- 
bles. The estimation of these parameter ensembles may thus substantially vary depending on 
which inferential goals are prioritised by the modeller. In this note, we consider the prob- 
lem of classifying the elements of a parameter ensemble above or below a given threshold. 
Two threshold classification losses (TCLs) -weighted and unweighted- are formulated. The 
weighted TCL can be used to emphasize the estimation of false positives over false negatives or 
the converse. We prove that the weighted and unweighted TCLs are optimized by the ensem- 
bles of unit-specific posterior quantiles and posterior medians, respectively. In addition, we 
relate these classification loss functions on parameter ensembles to the concepts of posterior 
sensitivity and specificity. Finally, we find some relationships between the unweighted TCL 
and the absolute value loss, which explain why both functions are minimized by posterior 
medians. 

KEYWORDS: Bayesian Statistics, Classification, Decision Theory, Epidemiology, Hierarchical 
Model, Loss Function, Parameter Ensemble, Sensitivity, Specificity. 

1 Introduction 

The problem of the optimal classification of a set of data points into several clusters has occupied 
statisticians and applied mathematicians for several decades (see Gordon, 1999, for a overview). 
As is true for all statistical methods, a classification is, above all, a summary of the data at hand. 
When clustering, the statistician is searching for an optimal partition of the parameter space into 
a -generally, known or pre-specified- number of classes. The essential ingredient underlying all 
classifications is the minimization of some distance function, which generally takes the form of a 
similarity or dissimilarity metric (Gordon, 1999). Optimal classification will then result in a trade- 
off between the level of similarity of the within-cluster elements and the level of dissimilarity of 
the between-cluster elements. In a decision-theoretic framework, such distance functions naturally 
arise through the specification of a loss function for the problem at hand. The task of computing 
the optimal partition of the parameter space then becomes a matter of minimizing the chosen loss 
function. 

In spatial epidemiology, the issue of classifying areas according to their levels of risk has been 
previously investigated by Richardson ct al. (2004). These authors have shown that areas can 
be classified according to the joint posterior distribution of the parameter ensemble of interest. 
In particular, a taxonomy can be created by selecting a decision rule D(a,C a ) for that purpose, 
where C a is a particular threshold, above and below which we classify the areas in the region of 
interest. The parameter a, in this decision rule, is the cut-off point associated with C a , which 
determines the amount of probability mass necessary for an area to be allocated to the above- 
threshold category. Thus, an area i with level of risk denoted by 9i will be assigned above the 
threshold C a if P[9i > C a \y] > a. Richardson et al. (2004) have therefore provided a general 
framework for the classification of areas, according to their levels of risk. However, this approach 
is not satisfactory because it relies on the choice of two co-dependent values C a and a, which can 
only be selected in an arbitrary fashion. 
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Our perspective in this paper follows the framework adopted by Lin et al. (2006), who intro- 
duced several loss functions for the identification of the elements of a parameter ensemble that 
represent the proportion of elements with the highest level of risk. Such a classification is based 
on a particular rank percentile cut-off denoted 7 e [0,1], which determines a group of areas of 
high-risk. That is, Lin et al. (2006) identified the areas whose percentile rank is above the cut-off 
point 7. Our approach, in this paper, is substantially different since the classification is based 
on a real-valued threshold as opposed to a particular rank percentile. In order to emphasize this 
distinction, we will refer to our proposed family of loss functions as threshold classification losses 
(TCLs). 

2 Classification of Elements in a Parameter Ensemble 

We formulate our classification problem within the context of Bayesian hierarchical models 
(BHMs). In its most basic formulation, a BHM is composed of the following two layers of random 
variables, 

yi^p(yi\0i,tn), g(0)~Pm), (1) 

for i = 1, . . . , n and where g(-) is a transformation of 9, which may be defined as a link function 
as commonly used in generalised linear models (sec McCullagh and Nelder, 1989). The vector of 
real- valued parameters, := {61, . . . , 9 n }, will be referred to as a parameter ensemble. 

2.1 Threshold Classification Loss 

For some cut-off point C € R, we define the penalties associated with the two different types of 
misclassification. Following standard statistical terminology, we will express such misclassifications 
in terms of false positives (FPs) and false negatives (FNs) . These concepts are formally described 
as 

FP(C,6,e cst ) :=I{0 < C\6 cst > C} , and FN(C, 9, 9 cst ) := 1 {9 > C, 9 cst < C) , (2) 

where 9 represents the parameter of interest and 9 cst is a candidate estimate. This corresponds to 
the occurrence of a false positive (type I error) and a false negative (type II error), respectively. 

For the decision problem to be fully specified, we need to choose a loss function based on the 
sets of unit-specific FPs and FNs. The p-weighted threshold classification loss (TCL p ) function is 
then defined as 

1 " 

TCL p (C,0,O := -^pFP(C,e i) 0? 8t ) + (l-p)FN(C,0 i ,0? 8t ). (3) 

One of the advantages of the choice of TCL p for quantifying the misclassifications of the elements 
of a parameter ensemble is that it is normalised, in the sense that TCL p (C, 6, cst ) e [0, 1] for any 
choice of C and p. Our main result in this paper is the following minimization. 

Proposition 1. For some parameter ensemble 8, and given a real-valued threshold Ce E and 



Ginestet, Best and Richardson 



3 



Classification Loss Function 



p e [0, 1], we have the following optimal estimator under weighted TCL, 

= argminE [TCL p (C, 0, ost )|y] , (4) 

(Jest 

where ^n-p) * s ^ e vector of posterior (1 — p)-quantiles defined as 

:= {Q 9l , y (l - p), . . . , Qe B | y (l - p)} , (5) 

where Qg.\ y (l — p) denotes the posterior (1 — p)-quantile of the i th element, 6i, in the parameter 
ensemble. Moreover, ^™p) is n °t unique. 

We prove this result by exhaustion in three cases. The full proof is reported in Appendix A. 
Note that the fact that TCL p is minimized by ^n_p) an d n °t ^ solely a consequence of 

our choice of definition for the TCL p function. If the weighting of the FPs and FNs had been 
(1 — p) and p, respectively, then the optimal minimizer of that function would indeed be a vector 
of posterior p-quantiles. 

2.2 Unweighted Threshold Classification Loss 

We now specialize this result to the unweighted TCL family which is defined analogously to 
equation (3), as follows, 

1 " 

TCL(C,0,0 est ) := - VFP(C,0 i ,0f rt )+FN(f7,0 i ,0? ,t ). (6) 
n i=i 

The minimizer of this loss function can be shown to be trivially equivalent to the minimizer of 
TCL0.5. That is, we have 

argminE[TCL(C,0,0 ost )|y] = argminE[TCL . 5 (C, 0, cst )\y], (7) 

£cst cst 

for every C, which therefore proves the following corollary. 

Corollary 1. For some parameter ensemble and C€ 1, the minimizer of the posterior expected 
TCL is 

mcd ■= ^o C 5) = |Q fll |y(0.5), . . . , Q e „, y (0.5)} , (8) 
and this optimal estimator is not unique. 

The posterior expected loss under the unweighted TCL function takes the following form, 

n c +ca 
E [TCL(C,0,0 cst )|y] = i / «|y]^R° st > C) + J dP[^|y]X{^ st < C) , (9) 

i=1 -oo C 

whose formulae is derived using 1{6 < C,8 cst > C} = 1 {9 < C}l{9 cst > C}. It is of special 
importance to note that when using the posterior TCL, any classification -correct or incorrect- 
will incur a penalty. The size of that penalty, however, varies substantially depending on whether 
or not the classification is correct. A true positive can be distinguished from a false positive, by 
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the fact that the former will only incur a small penalty proportional to the posterior probability 
of the parameter to be below the chosen cut-off point C. 



2.3 Relationship with Posterior Sensitivity and Specificity 

Our chosen decision-theoretic framework for classification has the added benefit of being readily 
comparable to conventional measures of classification errors widely used in the context of test 
theory For our purpose, we will define the Bayesian sensitivity of a classification estimator 9 cst , 
also referred to as the posterior true positive rate (TPR), as follows 

TPR((7, )._ E n =iP[fli>c|y] , (10) 

where the expectations are taken with respect to the joint posterior distribution of 9. Similarly 
the Bayesian specificity, or posterior true negative rate (TNR), will be defined as 

TNR(C, )- Er=1 P[^<C|y] ' 

where in both definitions, we have used TP(C, 6 U 6f *) := 1 {0 l > C, 0f * > C} and 
TN(C,0;,0f st ) := 1 {9 l < C,0? st < C}. It then follows that we can formulate the relationship 
between the posterior expected TCL and the Bayesian sensitivity and specificity as 



1 n n 

E[TCL(C,0,0 ost )|y] = " FPR(C, 9, 9 cst ) ¥[9, < C\y] H — FNR(C, 9, est ) ^ V[6 l > C\y}. 

i=l i=\ 

where FPR(C, d, 6» ost ) := 1 - TNR(C, 6», ost ) and FNR(C, 9, 9 cst ) := 1 - TPR(C, 9, 9 est ). 



3 Conclusion 

The fact that the posterior median is the minimizer of the posterior expected absolute value loss 
(AVL) function is well-known Bergcr (1980). That is, the posterior median minimizes the posterior 
expected AVL, where AVL(9,9 est ) := \9 — 9 cst \. One may therefore ask whether there is link 
between the minimization of the AVL function, which is an estimation loss and the classification 
loss function described in this paper. The proof of the optimality of the posterior median under 
AVL proceeds by considering whether # med — 9 cst ^ 0. This leads to a proof by exhaustion in 
three cases, which includes the trivial case where # med and 9 est are equal. Similarly in the proof of 
proposition 1, we have also obtained three cases, which are based on the relationships between the 
9i's and ^ st 's with respect to C. However, note that by subtracting ^ ^ < C from C < 6>° st and 
ignoring null sets, we obtain Of p) - 0f * < 0, for the second case. Similarly, a subtraction of the 
hypotheses of the third case gives 

0(1 p) _ 0est > o for the third 

case, which therefore highlights 
the relationship between the optimization of the AVL and the TCL functions. 
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Appendix A: Proof of TCL Minimization 

Proof of proposition 1 on page 3. 

Let p p (C, 0, cst ) denote E[TCL p (C, 0, 9 cst )\y]. We prove the result by exhaustion over three cases. 
In order to prove that 

Pp (c,e,e^)< Pp (c,e,e est ), (12) 

for any 9 cst G with f?f ~ p) := Q e .\ y {l-p), it suffices to show that p p (C, 9 h 6>f ~ p) ) < p p {C, 0;, 6f *) 
holds, for every i = 1, . . . , n. Expanding these unit-specific risks, 

P l{6^- p) > C}P [6, < C\y] + (1 - P )l{9^- p) < C}¥ [9, > C\y] 
< pl{9t st > C}F [Ot < C\y] + (1 - p)l{eT % < C}P & > C\y] . 

Now, fix C and p € [0, 1] to arbitrary values. Then, for any point estimate 9° st , we have 



pP[0i<C\y], if0f st >C, 
(l-p)V[0i>C\y], if0f st <C. 



The optimality of ^ over 9° st as a point estimate is therefore directly dependent on the 
relationships between flj 1 ^ and C, and between 9° st and C. This determines the following three 
cases: 

i. If 9^ ^ and 6>f st are on the same side of C, then clearly, 

p p (c,e i ,e\ 1 - p) ) = p p (c,e i ,er t ), (15) 



ii. If p) < C and 6>° st > C, then 



p p (C,M 1_P) ) = (1 -p)m > C\y] < P ¥[9 t < C\y] = p^CA^f), (16) 



iii. If 9f p) > C and 6f ' < C, then, 



p p (C, j; flf " p) ) - pPft < C|y] < (1 - p)F[0i > C\y] = p p (C, 9,, 9?*), (17) 

Equation (15) follows directly from an application of the result in (13), and cases two and three 
follow from consideration of the following relationship: 



9 i <C\y}%(l-p)¥[e i >C\y], (18) 
where ^ means either <, = or >. Using P[9i > C\y] = 1 — ¥[9i < C|y], this gives 

m<C\y] = F eily (C)%l-p. (19) 
Here, Fg.\ y is the posterior CDF of 0j. Therefore, we have 

C | F~l(l -p) =: Q 8tly (l - p) :=: 0^~ p \ (20) 

where ^ takes the same value in equations (18), (19) and (20). 

This proves the optimality of 0( 1 ~ p \ Moreover, since one can construct a vector of point 
estimates 9° st satisfying 9° st ^ C, whenever 0- 1 p ^ ^ C, for every i, it then follows that is 
not unique. 
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